System design is the process of defining the architecture, components, modules, interfaces, and data for a system to satisfy specified requirements. It is a crucial skill for senior software engineers, architects, and anyone involved in building scalable, reliable, and maintainable software.
The ability of a system to handle a growing amount of work by adding resources. We explore vertical vs. horizontal scaling and the trade-offs involved.
Ensuring the system performs its required functions under stated conditions for a specified period. Measured by Mean Time Between Failures (MTBF).
The percentage of time a system is operational. High availability is often expressed in "nines" (e.g., 99.999% - "five nines").
Measures the system's responsiveness, typically in terms of latency and throughput. We discuss how to optimize for both.
A single, unified application. Simple to develop and deploy initially, but can become complex and difficult to scale.
An application built as a collection of loosely coupled, independently deployable services. Enhances scalability and team autonomy.
An architecture where the cloud provider manages the server infrastructure, and developers only focus on writing functions (e.g., AWS Lambda).
An architecture where services communicate through events. This promotes loose coupling and is excellent for asynchronous workflows.
Distributes incoming network traffic across multiple servers to ensure no single server becomes a bottleneck. Algorithms include Round Robin, Least Connections, etc.
Stores frequently accessed data in a temporary storage layer to reduce latency and database load. Strategies include Cache-Aside, Read-Through, and Write-Back.
A geographically distributed network of proxy servers that cache content closer to users, reducing latency for static assets.
Enables asynchronous communication between services. Examples include RabbitMQ, Kafka, and AWS SQS. They help decouple services and handle load spikes.
We compare relational (SQL) and non-relational (NoSQL) databases, discussing their data models, consistency guarantees (ACID vs. BASE), and use cases.
A technique for horizontal scaling where a database is partitioned into smaller, faster, more manageable parts called shards. We cover sharding strategies like range-based and hash-based.
The process of creating and maintaining multiple copies of a database. This improves availability and read performance. We discuss master-slave and master-master replication.
Let's walk through designing a simplified version of a social media feed like Twitter or Facebook. This involves making decisions about API design, data storage, and feed generation.
// High-level API endpoints
POST /v1/users/{userId}/posts (content, media_urls) -> postId
GET /v1/users/{userId}/feed?page_token=... -> {posts, next_page_token}
// Data Schema (Simplified NoSQL)
Users: { userId, name, following: [userIds] }
Posts: { postId, authorId, content, timestamp }
Feeds: { userId, postIds: [postId] } // Pre-computed feed
// Feed Generation
// 1. Fan-out on write: When a user posts, push the postId to the feeds of all their followers.
// - Pros: Fast feed reads.
// - Cons: Slow for users with many followers (celebrity problem).
// 2. Pull on read: When a user requests their feed, query all the people they follow and aggregate their recent posts.
// - Pros: No "celebrity problem" on write.
// - Cons: Slow feed reads.
// Hybrid approach is often used.
System design interviews are about demonstrating your ability to think through a complex problem and make reasonable trade-offs. Here is a framework to approach them:
Understand the functional (e.g., post a tweet) and non-functional (e.g., low latency, high availability) requirements. Ask about scale (e.g., number of daily active users).
Draw a high-level architecture diagram with the main components (e.g., clients, API gateway, services, databases). Identify the data flow.
Choose a specific component and design it in detail. This could be the database schema, API design, or caching strategy. Discuss trade-offs.
Discuss potential bottlenecks and how to address them. This includes scaling databases, handling traffic spikes, and ensuring data consistency.